Google Books N-gram Corpus used as a Grammar Checker
نویسندگان
چکیده
In this research we explore the possibility of using a large n-gram corpus (Google Books) to derive lexical transition probabilities from the frequency of word n-grams and then use them to check and suggest corrections in a target text without the need for grammar rules. We conduct several experiments in Spanish, although our conclusions also reach other languages since the procedure is corpus-driven. The paper reports on experiments involving different types of grammar errors, which are conducted to test different grammar-checking procedures, namely, spotting possible errors, deciding between different lexical possibilities and filling-in the blanks in a text.
منابع مشابه
Developing an Unsupervised Grammar Checker for Filipino Using Hybrid N-grams as Grammar Rules
This study focuses on using hybrid n-grams as grammar rules for detecting grammatical errors and providing corrections in Filipino. These grammar rules are derived from grammatically-correct and tagged texts which are made up of part-of-speech (POS) tags, lemmas, and surface words sequences. Due to the structure of the rules used by this system, it presents an opportunity to have an unsupervise...
متن کاملDetecting Grammatical Errors in Text using a Ngram-based Ruleset
Applications like word processors and other writing tools typically include a grammar checker. The purpose of a grammar checker is to identify sentences that are grammatically incorrect based on the syntax of the language. The proposed grammar checker is a rule-based system to identify sentences that are most likely to contain errors. The set of rules are automatically generated from a part of ...
متن کاملSyntactic Annotations for the Google Books NGram Corpus
We present a new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages; it reflects 6% of all books ever published. This new edition introduces syntactic annotations: words are tagged with their part-of-speech, and headmodifier relationships are recorded. The annotations are produced automatically with...
متن کاملWeb-scale Surface and Syntactic n-gram Features for Dependency Parsing
We develop novel firstand second-order features for dependency parsing based on the Google Syntactic Ngrams corpus, a collection of subtree counts of parsed sentences from scanned books. We also extend previous work on surface n-gram features from Web1T to the Google Books corpus and from first-order to second-order, comparing and analysing performance over newswire and web treebanks. Surface a...
متن کاملLinked Open Data and Web Corpus Data for noun compound bracketing
This research provides a comparison of a linked open data resource (DBpedia) and web corpus data resources (Google Web Ngrams and Google Books Ngrams) for noun compound bracketing. Large corpus statistical analysis has often been used for noun compound bracketing, and our goal is to introduce a linked open data (LOD) resource for such task. We show its particularities and its performance on the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012